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(54) Client-server distributed speech recognition 

(57) A speech communication system and method 
for utilization on a communications network system, 
such as the Internet, comprising a plurality of acoustic 
recognizers embedded in the mobile electronic commu- 
nication devices for recognizing speech information and 
generating a first set of associated language informa- 
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tion, and a plurality of linguistic recognizers embedded 
in data processing devices, such as servers, for recog- 
nizing said first set of associated language information 
and generating a second set of associated language in- 
formation, thereby more accurately recognizing the 
speech information in a distributed speech recognition 
processing manner. 
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Description ' '" •' " " '■ •" ■ "' ' - ■ 

[0001] This invention relates generally to speech recognition information transmission and more specifically to a 
speech recognition communication between a remote mobile electronic device and a computer through the Internet, 
s [0002] Transmission of information from humans to machines has been traditionally achieved though manually-op- 
erated keyboards, which presupposes machines having dimensions at least as large as the comfortable finger-spread 
of two human hands. With the advent of electronic devices requiring information input but which are smaller than 
traditional personal computers, the information input began to take other forms, such as pen pointing, touchpads, and 
voice commands. The information capable of being transmitted by pen-pointing and touchpads is limited by the display 

10 capabilities of the device (such as personal digital assistants (PDAs) and cell phones). Therefore, significant research 
effort has been devoted to speech recognition systems for electronic devices. Among the approaches to speech rec- 
ognition by machine isforthemachineto attempt to decode a speech signal waveform based on the observed acoustical 
features of the signal and the known relation between acoustic features and phonetic sounds. This acoustic-phonetic 
approach has been the subject of research for almost 50 years, but has not resulted in much success in practice (cf . 

is Fundamentals of Speech Recognition, L. Rabiner & B.H. Juang, Prentice-Hall). Problems abound, for example, it is 
known in the speech recognition art that even in a speech waveform plot, "it is often difficult to distinguish a weak, 
unvoiced sound (like T or "th") from silence, or a weak, voiced sound (like V or "m") from unvoiced sounds or even 
silence" and there are large variations depending on the identity of the closely-neighboring phonetic units, the so-called 
coarticulation of sounds (ibid.). After the decoding, the determination of the word in the acoustic-phonetic approach is 

20 attempted by use of the so-called phoneme lattice which represents a sequential set of phonemes that are likely match- 
es to spoken input. The vertical position of a phoneme in the lattice is a measure of the goodness of the acoustic match 
to phonetic unit ("lexical access"). But "the real problem with the acoustic-phonetic approach to speech recognition is 
the difficulty in getting a reliable phoneme lattice for the lexical access stage" (ibid.); that is, it is almost impossible to 
accurately label an utterance accurately because of the large variations inherent in any language. 

25 [0003] In the pattern-recognition approach, a knowledge base of versions of a given speech pattern is assembled 
("training"), and recognition is achieved through comparison of the input speech pattern with the speech patterns in 
the knowledge base to determine the best match. The paradigm has four steps: (1 ) feature measurement using spectral 
analysis, (2) pattern training to produce reference patterns for an utterance class, (3) pattern classification to compare 
unknown test patterns with the class reference pattern by measuring the spectral "distance" between two well-defined 

30 spectral vectors and aligning the time to compensate for the different rates of speaking of the two patterns (dynamic 
time warping, DTW), and (4) decision logic whereby similarity scores are utilized to select the best match. Pattern 
recognition requires heavy computation, particularly for steps (2) and (3) and computation for large numbers of sound 
classes often becomes prohibitive. Therefore, systems relying on the human voice for information input, because of 
the inherent vagaries of speech (including homophones, word similarity, accent, sound level, syllabic emphasis, speech 

35 pattern, background noise, and so on), require considerable signal processing power and large look-up table databases 
in order to attain even minimal levels of speech recognition accuracy. Mainframe computers and high-end workstations 
are beginning to approach acceptable levels of voice recognition, but even with the memory and computational power 
available in present personal computers (PCs), voice recognition for those machines is so far largely limited to given 
sets of specific voice commands. For devices with far less memory and processing power than PCs, such as PDAs, 

40 mobile phones, toys, and entertainment devices, accurate general speech recognition has been hitherto impossible. 
For example, a typical voice-activated cellular phone allows preprogramming by reciting a name and then entering an 
associated number. When the user subsequently recites the name, a microprocessor in the cell phone will attempt to 
match the recited name's voice pattern with the stored number. As anyone who has used present day voice-activated 
cell phones knows, the match is sometimes inaccurate (due to inconsistent pronunciation, background noise, and 

« inherent limitations due to lack of processing power) and only about 25 stored numbers are possible. In PDA devices, 
it is necessary for device manufacturers to perform extensive redesign to achieve even very limited voice recognition 
(for example, present PDAs cannot search a database in response to voice input). 

[0004] As for different ways of speech input, spell mode utterances have problems with the confusable sets: {A,J, 
K}, {B,C,D,E,F,P,T,V,Z}, {Q,U}, {l,Y}, and {F.S.X}. These can generally only be discriminated based upon a small, critical 
so portion of the utterance. Since conventional recognition relies on a simple accumulated distortion score over the entire 
utterance duration (a binary "yes" or "no"), this does not place sufficient emphasis on the critical parts resulting in poor 
recognition accuracy. Clearly, an approach would be to weight the critical portions, but this method has not achieved 
high recognition accuracy and carries a heavy computational burden. 

[0005] In sum, the memory and computation necessary for accurate and speedy voice recognition also require in- 
55 creased electrical power and complex operating systems; all of these carry increased cost. Thus present speech rec- 
ognition technology is not feasible for hand-held information devices because of the former's weight, electrical power, 
complexity, and cost requirements. 

[0006] Of particular present day interest is mobile Internet access; that is, communication through mobile phones, 
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PDAs, and other hand-held electronic devices to the Internet. The Wireless Application Protocol (WAP) specification 
is intended to define an open, standard architecture and set of protocols for wireless Internet access. The Wireless 
Markup Language (WML) is defined by WAP as a presentation standard for mobile Internet applications. WML is a 
modified subset of the Web markup language Hypertext Markup Language (HTML), scaled appropriately to meet the 

s physical constraints and data capabilities of present day mobile devices, for example the Global System for Mobile 
(GSM) phones. Typically, the HTML served by a Web site passes through a WML gateway to be scaled and formatted 
for the hand-held device. First phase WAP services deliver existing Web content to mobile devices, but in the future, 
Web content could be created specifically for mobile users, delivering information such as weather, stock quotes, 
banking services, e-commerce/business, navigation aids, and the like. There are some commercially available products 

10 already, such as the Nokia 7110, the Ericsson MC218, and the Motorola Timeport. The demand for mobile wireless 
Internet access is expected to explode. Ericsson Mobile Communications predicts that by 2004, there will be one billion 
mobile Internet users. But efficient moble Internet access will require new technologies. There are data rate improve- 
ments on the horizon, such as General Packet Radio Service (GPRS), Enhanced Data Rates for GSM Evolution 
(EDGE), and the Third Generation Universal Mobile Telecommunications System (3G-UMTS). In particular, UMTS 

15 promises (in 2002) wideband data rates up to 2 megabits/second (over 200 times the 9.6 kilobit data rate of current 
GSM phones). But however much the transmission rates and bandwidth increase, the content is reduced or com- 
pressed, and the display features modified to efficiently display information, the vexing problem of information input 
and transmission at the mobile device end has not been solved. Conventional speech-to-lntemet communication re- 
quires the computational power and memory requirements of at least present-day personal computers (PCs) to perform 

20 the transmission of voice packets to the Internet Service Provider (ISP) servers utilizing the so-called Voice over Internet 
Protocol (VoIP). Even with such computing power and memory available, VoIP allows only limited recognition and 
accuracy. Further, conventional server-based speech recognition systems (for example, produced by the companies 
Nuance and SpeechWorks) can only provide service to fewer than ten users per server. Thus, for 100,000 putative 
users (not a particularly large number considering the number of present-day mobile phone users), 1 0,000 servers are 

25 needed, making such speech recognition economically unfeasible. The problem is thus scalability. For PC to server 
Internet voice communication, databases are typically downloaded from the server to the PC client (for example by 
the company Conversa), but the size of the database makes this method prohibitive for mobile devices. 
[0007] The present invention is a speech communication system and method for utilization on a communications 
network system, such as the Internet, comprising a plurality of acoustic recognizers embedded in the mobile electronic 

30 communication devices for recognizing speech information and generating a first set of associated language informa- 
tion, and a plurality of linguistic recognizers embedded in data processing devices, such as servers, for recognizing 
said first set of associated language information and generating a second set of associated language information, 
thereby more accurately recognizing the speech information in a distributed speech recognition processing manner. 

35 Brief Description of the Drawings 

[0008] 

Figure 1 is a block diagram of the personalized database according to the present invention. 
•*o Figure 2 is a block diagram of the speech recognition system according to the invention. 

Figure 3 is a block diagram of an LPC front-end processor according to the present invention. 

Figure 4 is a block diagram of the letter speech recognition system according to the present invention. 

Figure 5 is an example of a waveform for a letter as generated by the microphone according to the present invention . 

Figure 6 is the dynamic time warping initialization flowchart procedure for calculating the Total Distortion cepstrum 
45 according to the present invention. 

Figure 7 is the dynamic time warping iteration procedure flowchart for calculating the Total Distortion cepstrum 

according to the present invention. 

Figure 8 is the dynamic time warping flowchart for calculating the relative values of the Total Distortion cepstrum 
according to the present invention. 
so Figure 9 is a block diagram of the system architecture of a cellular phone having an embodiment of the present 

invention embedded therein. 

Figure 10 illustrates the word recognition performance results of one embodiment of the present invention as 
compared to the prior art systems ART and Sensory 

Figure 11 is a diagram of a preferred embodiment of the present invention utilizing acoustic recognizers at the 
55 mobile device end and linguistic recognizers at the server end. 

Figure 12 is a diagram of a preferred embodiment of the present invention utilizing 
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Detailed Description of Embodiments of the Invention -. . ,-.,...,_■„ ....... 

[0009] Figure 1 is a block diagram of the personalized database 100 of the present invention. A microphone 101 
receives an audio voice string (in one embodiment, a series of letters or characters) and converts the voice string into 

5 an electronic waveform signal. A front-end signal processor 1 02 processes the waveform to produce a parametric 
representation of the waveform suitable for recognition and comparison. In the preferred embodiment, the voice string 
is processed by linear predictive coding (LPC), producing a parametric representation for each letter or character (so- 
called "feature extraction") which removes redundant information from the waveform data to describe more distinctly 
each audio signal. The result, for example for letters of the English alphabet, is a 26 x 26 matrix wherein columns hold 

io the parametric representations of each letter and the rows will hold inputted speech letters. In the present invention, 
the matrix is a "calibration template" consisting of the individual user's pronunciation of the letters stored in pronunciation 
database 1 03. Because voice inputs are calibrated by the calibration template, typical speech recognition inaccuracies 
are avoided in aggregated utterance (e.g., word) comparisons. A sample calibration table is appended for reference 
at pages 24-26. 

is [0010] Figure 2 is a block diagram of the preferred embodiment of the invention. The microphone 101 receives a 
sequence of inputted utterances which are transmitted to the front-end signal processor 1 02 to form a parameterized 
voice string waveform set which is then compared with the pronunciation database 1 03 using an utterance comparator 
201 to select the best match for the individual utterances (e.g., letters or characters). As an example, suppose the 
name "Michael" is inaccurately pronounced "n-y-d-h-a-b-l" (the errors presumably due to confusable pronunciations 
20 of letters). In one embodiment, letter comparator 201 accepts the voice string and determines the "distance" between 
the voice string utterances and the calibration template in pronunciation database 1 03 by testing the six letters in the 
example against all the letters in pronunciation database 1 03. In another embodiment, similarly pronounced letters (or 
any sounds) are grouped based on similarity, so the comparison is more efficient. Aggregated utterance similarity 
comparator 202 compares the calibrated letter series waveform to the entries in a prerecorded vocabulary database 
25 203. In the example, even though the word may still not be accurately voice spelled, because there are only a limited 
number of sensical words such as "Michael", the chance of an accurate word match is considerably increased. In the 
preferred embodiment of the invention, vocabulary database 203 is a dictionary database available from the assignee 
of this invention, VerbalTek, Inc. Another embodiment of this invention advantageously utilizes a dictionary database 
from Motorola entered into vocabulary database 201 . Still another embodiment of this invention utilizes address book 
30 entries by the user. The present invention contemplates word dictionaries consisting of any terms which are desired 
by the user for vocabulary database 203. For example, specialized words for specific areas of endeavor (commercial, 
business, service industry, technology, academic, and all professions such as legal, medical, accounting, and so on) 
can be advantageously entered into vocabulary database 203. Further, the present invention contemplates advanta- 
geous utilization for monosyllabic word languages such as Chinese, wherein the individual utterances (Chinese char- 
ts acters) when aggregated into character strings become more distinct. Through comparison of the pre-recorded wave- 
forms in vocabulary database 203 with the inputted waveforms a sequential set of phonemes is generated that are 
likely matches to the spoken input, and a phoneme lattice is generated. The lattice is constructed by assigning each 
inputted waveform a "score" value based upon the closeness of each inputted combination to a word in vocabulary 
database 203. The "closeness" index is based upon a calculated distortion between the input waveform and the stored 
•« vocabulary waveforms, thereby generating "distortion scores". Since the scores are based on relatively accurate (com- 
pared with traditional speech recognition acoustic-phoneme methods) matches of letters or characters, the phoneme 
lattice produces word matches at 95% and above accuracy. The best matches for the words are then displayed on 
display 204. 

[001 1 ] In the preferred embodiment of the invention, the front-end signal processing to convert the speech waveform 
*s (an example of which is shown in Figure 5) to a parametric representation utilizes linear predictive coding (LPC). LPC 
is particularly suited for the present invention because (1) LPC is more effective for the voiced regions of the speech 
spectral envelope than for the unvoiced regions, and the present invention advantageously utilizes individual letter or 
character utterances which emphasize the distinctive character or letter or character sounds and have natural pauses 
(so that the unvoiced regions are less significant), and (2) LPC offers a simplified computation and an economical 
so representation that takes into consideration vocal tract characteristics (thereby allowing personalized pronunciations 
to be achieved with minimal processing and storage). The particular efficacy of LPC in the present invention is illus- 
trated, for example, in the LPC autocorrelation method, where it is assumed that the speech segment is identically 
zero outside of a given interval (tantamount to multiplying the speech signal by a finite length window), so the unvoiced 
regions are not well represented. In the LPC transfer function, H(z) = S(z)/GU(z), where the gain G of the source is 
55 estimated from the speech signal and the scaled source is used as input to a digital filter H(z), which is controlled by 
the vocal tract parameters characteristic of the speech being produced. 

[001 2] Figure 3 is a block diagram of an LPC front-end processoM 02 according to the preferred embodiment of the 
invention. A preemphasizer 301 which preferably is a fixed low-order digital system (typically a first-order FIR filter) 
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spectrally flattens the signal s(n), and is "described by: " ' 



H(z)=1-az" 1 (Eqn1) 

where 0.9 § a s i .o. 

[0013] In another embodiment of the invention, preemphasizer301 is a first-order adaptive system having the transfer 
function 

H(z) = 1 - a n z" 1 (Eqn 2) 

where a n changes with time (n) according to a predetermined adaptation criterion, for example, a„ = r n (1)/r n (0). 
[0014] Frame blocker 302 frame blocks the speech signal in frames of N samples, with adjacent frames being sep- 
arated by M samples. In this embodiment of the invention, N = M = 1 60 when the sampling rate of the speech is 8 kHz, 
corresponding to 20 msec frames with no separation between them. There is one feature per frame so that for a one 
second utterance (50 frames long), 12 parameters represent the frame data, and a 50 x 12 matrix is generated (the 
template feature set). 

[0015] Windower303 windows each individual frame to minimize the signal discontinuities at the beginning and end 
of each frame. Autocorrelator 304 performs autocorrelation giving 



<r — i— m 

25 r,(m)= £ X!(n)xi(n+m) (Eqn 3) 
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where m = 0, 1 p, and p is the highest autocorrelation value (the order of the LPC analysis). The preferred embod- 
iment of this invention uses p = 10, but values of p from 8 to 16 can also be advantageously used. The zeroth auto- 
correlation is the frame energy of a given frame. Cepstral coefficient generator 305 converts each frame into cepstral 
coefficients (the coefficients of the Fourier transform representation of the log magnitude spectrum, refer below) using 
Durbin's method, which is known in the art. Tapered windower 306 weights the cepstral coefficients in order to minimize 
the effects of noise. Tapered windower 306 is chosen to lower the sensitivity of the low-order cepstral coefficients to 
overall spectral slope and the high-order cepstral coefficients to noise (or other undesirable variability). Temporal dif- 
ferentiator 307 generates the first time derivative of the cepstral coefficients preferably employing an orthogonal pol- 
ynomial fit to approximate (in this embodiment, a least-squares estimate of the derivative over a finite-length window) 
to produce processed signal S'(n). In another embodiment, the second time derivative can also be generated by tem- 
poral differentiator 307 using approximation techniques known in the art to provide further speech signal information 
and thus improve the representation of the spectral properties of the speech signal. Yet another embodiment skips the 
temporal differentiator to produce signal S°(n). It is understood that the above description of the front-end signal proc- 
essor 1 02 using LPC and the above-described techniques are for disclosing the preferred embodiment, and that other 
techniques and methods of front end signal processing can be advantageously employed in the present invention. The 
comparison techniques and methods for matching strings of utterances, be they individual characters or words, are 
substantially similar, so the following description encompasses both comparators 201 and 202. 
[0016] In the preferred embodiment of the present invention, the parametric representation utilizes cepstral coeffi- 
cients and the inputted speech is compared with the letter or word string entries in the prerecorded databases, by 
comparing cepstral distances. The inputted letters (or letters in word combination) generate a number of candidate 
character (or letter) matches which are ranked according to similarity. In the comparison of the pre-recorded waveforms 
with the input waveforms, a sequential set of phonemes that are likely matches to the spoken input are generated 
which, when ordered in a matrix, produces a phoneme lattice. The lattice is ordered by assigning each inputted wave- 
form a "score" value based upon the closeness of each inputted combination to a word in the vocabulary database. 
The "closeness" index is based upon the cepstral distance between the input waveform and the stored vocabulary 
waveforms, thereby generating "distortion scores". Since the scores are based on relatively accurate (compared with 
traditional speech recognition acoustic-phoneme methods) matches of characters, the phoneme lattice of this invention 
produces word matches at 95% and above accuracy. 

[0017] Figure 4 shows the waveform parametric representation inputted to letter calibrator 401 wherein, in conjunc- 
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tion with pronunciation database 1 03; a calibration matrix (example appended) is generated: Distortion calculator 402 
calculates the distortion between the inputted speech and the entries in pronunciation database 103 based on, in the 
preferred embodiment, the calculated cepstral distances (described below). Scoring calculator 403 then assigns scores 
based on predetermined criteria (such as cepstral distances) and selector 404 selects the candidate letter (word). 
[0018] The distance between two speech spectra on a log magnitude versus frequency scale is 

V(a>)=logS(a>)-logS'(a>). (Eqn 4) 

[0019] To represent the dissimilarity between two speech feature vectors, the preferred embodiment utilizes the 
mean absolute of the log magnitude (versus frequency), that is, a root mean squared (rms) log spectral distortion (or 
"distance") measure utilizing the set of norms 



d(S, SY = [jV(a>)\ <W2n (Eqn 5 ) 



where when p = 1 , this is the mean absolute log spectral distortion and when p = 2, this is the rms log spectral distortion. 
[0020] In the preferred embodiment, the distance or distortion measure is represented by the complex cepstrum of 
a signal, which is defined as the Fourier transform of the log of the signal spectrum. For a power spectrum which is 
symmetric with respect to »= 0 and is periodic for a sampled data sequence, the Fourier series representation of log 
S(<») is 



logS(u) = £ c n e Jnu (Eqn 6) 



where c n = c. n are the cepstral coefficients. 



co= [jogS(a))da>/2x (Eqn 7) 



d(S, Sf = fjlogSf» - logS 1 (a) 2 dco/2n= £ (c - cf (Eqn 8) 



where c n and c„' are the cepstral coefficients of S(a>) and S'(<d), respectively. By not summing infinitely, for example 
10-30 terms in the preferred embodiment, the present invention utilizes a truncated cepstral distance. This efficiently 
(meaning relatively lower computation burdens) estimates the rms log spectral distance. Since the perceived loudness 
of a speech signal is approximately logarithmic, the choice of log spectral distance is well suited to discern subjective 
sound differences. Furthermore, the variability of low cepstral coefficients is primarily due to vagaries of speech and 
transmission distortions, thus the cepstrum (set of cepstral distances) is advantageously selected for the distortion 
measure. 

[0021] Different acoustic renditions of the same utterance are often spoken at different time rates so speaking rate 
variation and duration variation should not contribute to a linguistic dissimilarity score. Dynamic time warper (DTW) 
408 performs the dynamic behavior analysis of the spectra to more accurately determine the dissimilarity between the 
inputted utterance and the matched database value. DTW 408 time-aligns and normalizes the speaking rate fluctuation 
by finding the "best" path through a grid mapping the acoustic features of the two patterns to be compared. In the 
preferred embodiment, DTW 408 finds the best path by a dynamic programming minimization of the dissimilarities. 
Two warping functions, <p x andipy, relate two temporal fluctuation indices, i x and iy respectively, of the speech pattern 
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to a common time axis, k, so that ~ ' 

i x =cp x (k), k=l,2,...,T 

(Eqn9) 

i y =(p y (k) k=l,2,...,T. 

[0022] A global pattern dissimilarity measure is defined, based on the warping function pair, as the accumulated 
distortion over the entire utterance: 

<WX,Y) = £ d(cp x (k),cp y (k))m(k)/M„ (Eqn 10) 



where d(<|) )t (k),(p y (k)) is a short-time spectral distortion defined for x^kjy^k), m(k) is a nonnegative weighting function, 
20 m^s a normalizing factor, and T is the "normal" duration of two speech patterns on the normal time scale. The path 
9= (<Px •%) is chosen so as to measure overall path dissimilarity with consistency. In the preferred embodiment of the 
present invention, the dissimilarity d(X,Y) is defined as the minimum of (X,Y) over all paths, i.e., 



d(X,Y) = min d„(X,Y) (Eqn 1 1 ) 



[0023] The above definition is accurate when X and Y are utterances of the same word because minimizing the 
30 accumulated distortion along the alignment path means the dissimilarity is measured based on the best possible align- 
ment to compensate for speaking rate differences. It is known in the art that dynamic programming can solve sequential 
decision problems such as that described immediately above by finding the optimal path, meaning the minimum "cost" 
of moving from one point to another point. In one embodiment of the present invention, since the number of steps 
involved in the move are determined by "if-then" statements, the sequential decision is asynchronous. The decision 
35 utilizes a recursion relation that allows the optimal path search to be conducted incrementally and is performed by an 
algorithm in the preferred embodiment of the present invention as described below. The decision rule for determining 
the next point in an optimal path (the "policy"), together with the destination point, completely defines the cost which 
is sought to be minimized. The optimal policy for a move from the initial point 1 to an intermediate point j incurring a 
cost£(j, i), is given by 



<p(l, i) = min [ 9 (1, j) +^G\ i)] (Eqn 12) 

so for the optimal sequence of moves and associated minimum cost from a point i to a point j, 
<P(», j) = min [(p(i, 1) +<p(l, j)]. (Eqn 13) 



[0024] In another embodiment, the sequential decision is synchronous (regularity of decision process for a fixed 
number of moves, M), the associated minimum cost<p m (i, 1) is 



<PmH(i, n) = min [cp m (i, l) +£(1, n )] (Eqn 14) 
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• which is the recursion relation used in an' embodiment of the present invention. 

[0025] In both of the embodiments described above, the method follows the steps of (1 ) initialization, (2) recursion, 
(3) termination, and (4) backtracking as follows: 

s Initialization: <p.,(i, n) =£(i, n) 

^(n) = i,forn=1,2 N 

Recursion: 

<PmH(i,n) = min [cp m (i, 1) +^(1, n )] 

10 

5m*i(n) = argmin [<p m (i, 1) +£(1, n )], for n = 1, 2 N and 

15 m = 1, 2, M-2 

Termination: 

<P M (iJ) = min [9M-i(i,l)+ai,j)] 

20 

^ M 0') = argmin [<p M .,(i, 1) +£(1, j)] 

25 Path Backtracking optimal path = (i, i-,, i 2 Im-i,. j), 

where 1,,,=^! (!„„.,). ">= M-2, .... 1, 
withi M =j. 

[0026] The above algorithm is economical in the computing sense and thus suitable for implementation in the hand- 

30 held devices contemplated by this invention. 

[0027] Figures 6, 7, and 8, constitute a flow chart of the preferred embodiment of DTW 408 for computing the Total 
Distortion between templates to be compared. The "distance" d(i,j) (Eqn. (1 1 ) above) is the distortion between the i th 
feature of template X and the j* feature of template Y. Figure 6 depicts the initialization procedure 601 wherein the 
previous distance is d(0,0) at 602. The index j is then incremented at 603 and the previous distance now is the distance 

35 at j (prev distfj] which is equal to prev dist [j-1] + d(0,j). At 605, if j is less than number of features in template Y (j<numY), 
then j will be incremented at 606 and fed back to 604 for a new calculation of prev_distQ]. If j is not greater than numY, 
then the initialization is complete and the Iteration Procedure 611 for the Total Distortion begins as shown in Figure 7. 
At 612, index i is set at one and the current distance (curr_dist[0]) is calculated as the prev_dist[0] plusd(i.O). At 614, 
j is set to one and the possible paths leading to an associated distance d1 , d2, or d3 are calculated as: 

40 

curr_distfj-1] + d(i,j) = d1 



prev_distfj] + d(i,j) = d2 



prev_distfj-1 ] + d(i,j) = d1 . 

so [0028] The relative values of the associated distances are then tested at 621 and 622 in Figure 8. If d3 is not greater 
than d1 and not greater than d2, then d3 is the minimum and curr_distfj] will be d3 at 623. After testing forthe j* feature 
to be less than the number of features in the Y template at 626, then j is incremented at 617 and fed back to the 
calculation of distances of possible paths and the minimization process recurs. If d2 is greaterthan d1 and d3 is greater 
than d1 , then d1 is the minimum and is thus set as cun_distQ]. Then j is again tested against the number of features 

55 in the Y template at 626, j is incremented at 61 7 and fed back for recursion. If d3 is greater than d2 and d1 is greater 
than d2, then d2 is the minimum and is set as cun_distfj], and the like process is repeated to be incremented and fed 
back. In this way, the minimum distance is found. If j is greaterthan or equal to the number of features in template Y 
at 626, then i is tested to see if it is equal to the number of features in template X minus 1 . If i is not equal to the number 
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of features in template X minus 1 , then the previous distance is set as the current distance for the j indices (up to numY- 
1) at 618, i is incremented at 616 and fed back to 613 for the setting of the current distance as the previous distance 
plus the new i* distance and the process is repeated for every i up the time j equals the number of features in template 
X minus 1 . If i is equal to the number of features in the X template minus 1 , then theTotal Distortion is calculated at 628 as 

5 

Total Distortion = cwr_dta(numY-1) 
(numY-numY-V ' 

thus completing the algorithm for finding the total distortion. 
10 [0029] To achieve optimum recognition accuracy, the warping functions are constrained. It is known in the art that 
even small speech endpoint errors result in significant degradation in speech detection accuracy. In carefully-enunci- 
ated speech in controlled environments, high detection accuracy is attainable, but for general use (such as in mobile 
phones), the vagaries of the speaker sounds (including lip smacks, breathing, clicking sounds, and so on), background 
noise, and transmission distortion (cross-talk, intermodulation distortion, and tonal interference) make accurate end- 
's point detection difficult. If the utterances have well-defined endpoints (marking the beginning and ending frames of the 
pattern), similarity comparisons will result in more accurate recognition. The present invention, in its utilization of indi- 
vidual characters (e.g., letters) for input utterance, achieves accuracy as a result of the generally more precise enun- 
ciation of individual characters (letters) and the typical pauses between individual characters (letters) when a character 
string is enunciated or a word is spelled. Temporal variations thus are generally confined to the region within the end- 
20 points. The constraints for the warping function are thus simply setting the values at the beginning and ending points 
as the first and last temporal fluctuation indices, i„= 1 , and iy= T. These endpoint constraints are incorporated into the 
present invention through Eqn (11), in terms of T x and T y : 



25 



35 



40 



45 



50 



55 



M 9 d(X,Y) s D(T X , T y ) =min d(cp x (k),<p y (k))m(k) (Eqn 15) 



so where X and Y terminate at T„ and T y respectively. 

[0030] The preferred embodiment of the present invention provides a dynamic time warping regime that is optimally 
suited for the individual character (e.g., spelling) input utterance speech recognition system of the present invention. 
DTW 408 utilizes Eqn 15 to generate the minimum partial accumulated distortion along a path connecting (1,1) and 
(i x , i y ) as: 



D(ix ' ^ " J™^ S d(cp x (k),<p y (k))m(k) (Eqn 16) 

whereq^fT) = i x and<py(T) = iy and the dynamic programming recursion with constraints becomes 
D(U, i y ) = min [ D(i x \ i y ') + <; ((i x \ i/), (i x , i y ))] ( E q n 1 7) 

where £ is the weighted accumulated distortion (local distance) between points (i,,', i y ') and (i x , i y ), 

i 

S ((ix', iy'), (ix, iy)) = Z d(cp x (T'-l),cp y (T-l))m(T-l) (E qn 1 8) 



with Lg being the number of moves in the path from (i x \ iy') to (i x , iy) according to<p„ andqy The incremental distortion 
£ is evaluated only along the paths defined by the various constraints, thus the minimization process can be effectively 
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solved within the constraints. However, the heuristic nature of dissimilarity can also be advantageously included in the 

method; for example, in this invention a frame is cut into multiple segments to distinguish between confusable letter 
utterances, such as "a" and T. It is understood that many different constraints and combinations of constraints are 
within the scope of the present invention. In the utterances of different letters, for instance, the time-alignment for the 
s most accurate comparison is not a well-defined linguistic concept so intuitive constraints are utilized in the present 
invention. 

[0031 ] The preferred embodiment of the present invention, due to its small form factor, allows simple integration into 
existing operating systems (for example, Microsoft Windows CE® for PDAs and ARM7TDMI for cell phones) of con- 
sumer electronic devices, thereby obviating extensive redesign and reprogramming. An embodiment of the present 

io invention's speech recognition programs also may be loaded into the flash memory of a device such as a mobile phone 
or PDA, thus allowing easy, quick, and inexpensive integration of the present invention into existing electronic devices, 
thereby making redesign or reprogramming of the DSP of the host device unnecessary. Further, the speech recognition 
programs may be loaded into the memory by the end-user through a data port coupled to the flash memory. This can 
be achieved also through a download from the Internet. Thus the present invention can be easily installed in mobile 

is devices for communication with the Internet through the Internet itself. 

[0032] Figure 9 illustrates the system architecture of a cellular phone with an embodiment of the present invention 
embedded therein. Flash memory 901 is coupled to microprocessor 902 which in turn is coupled to DSP processor 
903, which in conjunction with flash memory 901 and microprocessor 902, performs the speech recognition described 
above. Read-Only-Memory (ROM) device 904 and Random Access Memory (RAM) device 905 service DSP processor 

20 903 by providing memory storage and access for pronunciation database 104 and vocabulary database 203. Speech 
input through microphone 907 is coded by coder/decoder (CODEC) 906. After speech recognition by DSP processor 
903, the speech signal is decoded by CODEC 906 and transmitted to speaker 908 for audio confirmation (in one 
embodiment of the present invention). Alternatively, speaker 908 can be a visual display. As an example of one em- 
bodiment of the application protocol interface (API) of the present invention, the specifications, utilizing ARM77TDMI 

25 as base, are: Memory usage requires a code size of 1 0 KB, scratch pad size of 4KB, and the storage (per template) 
is 0.5 KB; computational requirements are speech feature extraction of 1 .9 MIPS and speech recognition per template 
of 0.5 MIPS. Speech recognition error performance results and the computional power estimates of one embodiment 
of the present invention (VerbalTek™) are shown in Figure 1 0 where comparisons are made with speech recognitions 
systems from the companies ART, Sensory, and Parrot. The present invention achieves error percentage results that 

30 are significantly lower that those of products of companies (ART and Sensory) which require only "small" computational 
power (MIPS) and comparable accuracy to that of Parrot which requires relatively "huge" computational power. 
[0033] The present invention thus can be advantageously used for WAP voice commands for Internet communication, 
e-mail messages, and voice access of large numbers of phonebook entries. 

[0034] Distributed data processing can be described in terms of client-server systems wherein each performs some 
35 processing and memory storage with the bulk of processing and storage being done at the server. The present invention 
is the separation of a language-dependent processor and database at the client end and a language-independent 
processor and database at the server end of a client-server system for voice information communication. An example 
is a PDA, mobile phone, or other electronic device as client and an Internet Service Provide (ISP) server as server. 
Because of decreased computational power requirements, the present invention overcomes the scalability problem of 
to the prior art. 

[0035] Figure 1 1 shows the preferred embodiment of the client-based language-dependent speech recognition por- 
tion and the server-based language-independent speech recognition portion in the distributed processing scheme 
according to the present invention. Taken together, the two portions can perform accurate speech recognition for mobile 
electronic device communication with a server, for example an ISP server. Mobile devices 1101, 1102, 1103, each 

« includes an acoustic recognizer 11 10, 1111, 1112, ...respectively, which can be customized to the user's speech patterns 
and vagaries (for example, by utilizing pronunciation database 103 and utterance comparator 201, as described 

above) . Servers 1104, 1105, 1106, .... each include linguistic recognizers 1107, 1108, 1109 respectively, which 

perform the bulk of the speech recognition (for example, by utilizing vocabulary database 203 and aggregated utter- 
ances similarity comparator 202, as described above). Server 1 1 04 can be based at one website and server 1 1 05 can 

so be based at another website, and so on. Because of lowered computational power requirement, one server can serve 

many clients. The linguistic recognizers 1107, 1108, 1109, ... at each website server 1104, 1105, 1106 can be 

particularized in their ability to recognize speech according to the nature of the website; for example specialized com- 
mercial, technical, medical terminology and the like can be more acccurately recognized by specialized (or more com- 
prehensive pronunciation variations) entries in vocabulary database 203. 

55 [0036] Figure 1 2 shows another embodiment of the present invention in a mobile electronic devices-to-lnternet Serv- 
ice Provider servers system. Mobile devices 1201, 1202, 1203 each include speech recognition systems 1210, 

1211, 1212 respectively. Servers 1220, 1221, 1222 include word string databases 1231, 1232, 1233 re- 
spectively, which can recognize the word inputs from mobile devices 1201 , .... In this embodiment, the bulk of the 
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speech recognition is performed at the mobile device and is a complete system (for example, including pronunciation 
database 1 03, utterance comparator 201 , and aggregated utterances similarity comparator 202, as described above). 

In a preferred embodiment of speech recognizers 1 201 , 1 21 1 , 1 21 2 speech is inputted through a microphone that 

converts the acoustic signal into electronic signals which are parameterized and compared with pronunciation database 

s 1 03. The best matches based on predetermined criteria (such as cepstral distances) are selected, the selections are 
aggregated, and then transmitted through the Internet. Web servers 1220, 1221, 1222, receive the transmitted 
aggregated utterances and compare them with entries in databases 1231 , 1232, 1233 The best matches are se- 
lected utilizing predetermined criteria (such as cepstral distances) and the speech input is thereby recognized. All of 
the capabilities and features described above in the general speech recognition description above can be incorporated 

10 into the distributed speech recognition systems illustrated in Figures 11 and 12, but any speech recognition system 
and/or method can be advantageously utilized in the present invention. 

[0037] In operation, a user may use the speaker-independent input default mode whereby a prepackaged character 
(letter) database for speech recognition is used. To create ("training") personalized database 100, a user records al- 
phabet and numeral sounds by dictating into the system from (for an English example) "a" to °z" and "0" to "9° in a 

is natural voice with a pause of at least 0.2 seconds between individual sounds, thereby generating a "voice string". In 
one embodiment of the invention, if the letters are run-on, the endpoint detection scheme described above will detect 
indistinguishable utterances, and the user will be instructed through the display to cease recitation and repeat the 
dictation from the beginning. The individual letters of the voice string are converted using front-end signal processor 
1 02 which produces a waveform for each letter (such as that shown in Figure 5). The waveforms are then segmented, 

20 assigned an address in memory and then stored in memory so that each utterance is mapped into pronunciation 
database 1 04 (a process termed "labeling"). Pronunciation database 1 03 in conjunction with utterance similarity com- 
parator 201 forms, for letters of the English alphabet, the 26 x 26 matrix has columns for the stored waveforms for 
each letter in pronunciation database 104 and rows for the inputted speech letters for recognition analysis (a sample 
matrix is appended). Utterance similarity comparator 201 compares the inputted utterance with all the letters in the 

25 columns (pronunciation database 1 03) to find the best match. For example, the inputted word "seat" will be spelled by 
the user,"s-e-a-t". Because of the vagaries of pronunciation, background noise, and other factors, the letters may be 
recognized as "x-e-k-d" (each of which letter's pronunciation is similar to the desired letters, and therefore mistakenly 
"recognized"). In the preferred embodiment of this invention, before the comparison with letters in the database is 
made, letters with similar pronunciations are grouped together so that the search is more efficient (the search matrix 

30 dimensions will be smallerthan a 26 x 26 matrix for the English alphabet). For example, the grouping in the preferred 
embodiment of this invention emphasizes the vowel sounds of the syllables and has been found to significantly reduce 
the similarity computation, thereby making the grouping ideal for hand-held devices. This grouping assigns "a", "j", and 
"k" to the same group; "x", "s", and T to another group; and "b", "c", "d", "e", and "g" to yet another group. As an 
illustration of the distortion scoring technique, in the "s-e-a-f example, the first letter "s" is initially recognized as °x" 

35 so there will be a non-zero distortion score assigned based on the cepstral distance (e.g., 2.0); the next letter "e" is 
correctly recognized, so the score will be 0; the next letter "a" is recognized as "k" which is assigned a score of 1 .5; 
the last letter T is recognized as "d" which is assigned a score of 1 .0. The total distortion score for the word is 4.5. 
The distortion scores are then compared in combination with the words in vocabulary database 203. The selected 
candidate letters, in combination however, are more distinct (and "xekd" does not exist as a word). Word similarity 

40 comparator 202 computes a distortion score using the above-described techniques, so that an inputted "xekd" will 
produce distortion scores with words as follows: 



Input Word 


Candidates 


Letter Scores 


Distortion Score 


Similarity % 


xekd 


seat 


S1+S2+S3+S4 = S 


1200 


89% 




feat 


T1+T2+T3=T4 = T 


2380 


75% 




heat 


U1+U2+U3+U4= U 


4530 


68% 




beat 


V1+V2+V3+V4 = V 


8820 


42% 



[0038] Word comparator202 ranks the distortion scores of each comparison to determine the lowest distortion score, 
which is the closest match (or greatest similarity) with a word in vocabulary database 203. Display 204 displays the 
selected word (or individual letter) for confirmation by the user. Any alphanumeric display device, for example a liquid 
crystal display (LCD), may be advantageously utilized. For uses in mobile phones or PDAs, the combination of letters 
then constitute the word which then can be matched to the telephone number or other transmission index for trans- 
mission. 

[0039] Although many of the examples in this description are based on the English alphabet, it is understood that 
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they are exemplary only and that the scope of the present invention includes other languages as well, the only restriction 
being that such language is based on distinguishable sounds. In fact, an embodiment of the present invention provides 
multiple-language capability since any language's speech recognition can be achieved by the present invention be- 
cause it is primarily dependent upon the contents of the utterance and vocabulary databases, which can be changed 
s for any language. Similarity comparisons with the spoken sounds and the vocabulary database can be accomplished 
by the present invention as described above and accuracy can be achieved through the user's dictation of sounds to 
construct the pronunciation database 104. 

[0040] In typical use, the present invention allows voice recognition to be achieved in 1 -2 seconds, thereby providing 
a smooth interface with the user. Accuracy in test results has been consistently at the 95% level. 

10 [0041 ] It is particularly noted herein that the present invention is ideal for inputting the monosyllabic character-based 
letters (or words) of the Chinese language. For example, the words for "mobile phone" are transliterated as a character 
string "xing-dong-dian-hua". Each word is monosyllabic and has its own meaning (or several meanings), but in aggre- 
gate comprise the unique term for "mobile phone". The present invention provides a highly accurate recognition (in 
part due to heuristic calibration) of individual monosyllabic letters, which when taken in aggregate to form a word, 

'5 produces even more accurate recognition because of the limited number of sensical choices. 

[0042] While the above is a full description of the specific embodiments, various modifications, alternative construc- 
tions and equivalents may be used. For example, the present invention is suitable for any verbal language that can be 
separated into utterances; alphabetical languages where the utterances are associated with letters of an alphabet 
(such as English and Russian) and symbolic languages where the utterances are associated with characters (such as 

20 Chinese and Japanese). Further, any speech recognition system or technique can be advantageously utilized. There- 
fore, the above description and illustrations should not be taken as limiting the scope of the present invention which is 
defined by the appended claims. 
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Claims 

is 1. In a communications network system having a plurality of mobile electronic communication devices mutually 

communicable with a plurality of dataprocessing devices, an acoustic information recognition system comprising: 

an acoustic recognizer electrically disposed on each of the mobile communication devices for recognizing the 
acoustic information and generating a first set of associated language information; and 
20 a linguistic recognizer electrically disposed on each of the data processing devices for recognizing said first 

set of associated language information and generating a second set of associated language information. 

2. The acoustic information transmission system of claim 1 wherein said acoustic recognizer comprises: 

25 a front-end signal processor for parameterizing the acoustic information; 

an utterance pronunciation database storage device for storing a plurality of parametric representations of 
utterance pronunciations; and 

utterance similarity comparator means, coupled to said front-end signal processing means and to said pro- 
nunciation database storage means, for comparing the parametric representation of the acoustic information 
30 with said plurality of parametric representations of utterance pronunciations, and selecting a first sequence of 

associations between said parametric representations of the acoustic information and at least one of said 
plurality of parametric representations of utterance pronunciations responsive to predetermined criteria. 

3. The acoustic information transmission system of claim 2 wherein said linguistic recognizer comprises: 

35 

a vocabulary database storage device for storing a plurality of parametric representations of word string pro- 
nunciations; and 

an aggregated utterances similarity comparator, coupled to said acoustic recognizer and to said vocabulary 
database storage device, for comparing said first sequence of associations with said plurality of parametric 
40 representations of aggregated utterance pronunciations stored in said vocabulary database storage device, 

and selecting a second sequence of associations between said selected parametric representations of the 
aggregated utterance pronunciations with at least one of said plurality of parametric representations of word 
string pronunciations responsive to predetermined criteria. 

45 4. A distributed speech information communication system, communicable with the Internet, comprising: 

a plurality of mobile electronic communication devices, communicable with the Internet, each including a mi- 
crophone for converting the acoustic signals into electronic signals; 

a plurality of acoustic recognizers, being coupled one-to-one to said microphones, said plurality of acoustic 
so recognizers each having a stored database comprising utterance pronunciations, for converting the electronic 

signals into utterance information and comparing said utterance information with said utterance pronuncia- 
tions, selecting at least one of said utterance pronunciations, aggregating said selected utterance pronuncia- 
tions, and transmitting said selected aggregated utterance pronunciation through the Internet; 
a plurality of data processing devices, communicable with the Internet, for receiving said selected aggregated 
55 utterance pronunciations through the Internet; 

a plurality of linguistic recognizers, being coupled one-to-one to said plurality of data processing devices, said 
plurality of linguistic recognizers each having a stored database comprising word string pronunciations, for 
comparing said aggregated utterance pronunciations with said word string pronunciations, and selecting at 
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least one of said word string pronunciations, thereby recognizing the language information. 

5. In a communications network system having a plurality of mobile electronic communication devices each having 
disposed therein an utterance pronunciation database, the mobile electronic communication devices being mutu- 

5 ally communicable with a plurality of dataprocessing devices each having disposed therein a word string pronun- 

ciation database, a method for recognizing acoustic information comprising the steps of: 

(a) parameterizing the acoustic information; 

(b) comparing the parameterized acoustic information with the utterance pronunciations in the utterance pro- 
10 nunciation database; 

(c) selecting at least one of the utterance pronunciations responsive to predetermined critieria; 

(d) aggregating the selected at least one of the utterance pronunciations; 

(e) comparing the selected at least one of the utterance pronunciations with the word string pronunciations in 
the word string pronunciation database; and 

is (f) selecting at least one of the word string pronunciations responsive to predetermined criteria, thereby rec- 

ognizing the acoustic information. 

6. The method of claim 5 wherein the acoustic information, the utterance pronunciations, and the word string 
pronunciations are in the Chinese language. 

20 

7. The method of claim 5 wherein the acoustic information, the utterance pronunciations, and the word string 
pronunciations are in the Japanese language. 

8. The method of claim 5 wherein step (a) comprises utilizing cepstral coefficients to parameterize the acoustic 
25 information. 

9. The method of claim 5 wherein the predetermined criteria of step (c) are calculations of cepstral distances. 

10. The method of claim 5 wherein the predetermined criteria of step (f) are the calculation of cepstral distances. 

30 

1 1 . In a communications network system having a plurality of mobile electronic communication devices, the mobile 
electronic communication devices being mutually communicable with a plurality of dataprocessing devices, a meth- 
od for recognizing acoustic information comprising the steps of: 

35 (a) parameterizing and storing utterances in each- of the mobile electronic communication device to comprise 

an utterance pronunciation database; 

(b) parameterizing and storing word string pronunciations database in each of the dataprocessing devices to 
comprise a word string pronunciation database; 

(c) parameterizing and storing the acoustic information in at least one of the mobile electronic communication 
40 devices; 

(d) comparing the parameterized acoustic information with the utterance pronunciations in the utterance pro- 
nunciation database in the mobile communication device; 

(g) selecting at least one of the utterance pronunciations responsive to predetermined critieria; 

(h) aggregating the selected at least one of the utterance pronunciations; 

45 (j) comparing the selected at least one of the utterance pronunciations with the word string pronunciations in 

the word string pronunciation database in the data processing device; and 

(j) selecting at least one of the word string pronunciations responsive to predetermined criteria, thereby rec- 
ognizing the acoustic information. 

so 12. The method of claim 11 wherein the acoustic information, the utterance pronunciations, and the word string 

pronunciations are in the Chinese language. 

13. The method of claim 11 wherein the acoustic information, the utterance pronunciations, and the word string 
pronunciations are in the Japanese language. 

55 

1 4. The method of claim 1 1 wherein steps (a), (b), and (c) parameterization comprises utilizing cepstral coefficients. 

15. The method of claim 11 wherein the predetermined criteria of steps (g) and (j) are calculations of cepstral 
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distances. 
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Fig.6. 
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